CRIM 515 Final Project Template
02 December 2024
- 1. Research Question
- 2. Data
- 3. Methods
- 4. Findings
- Writing Tips
- Other Information/Reminders
- Grading: 30pts
The Final Project follows the same general flow as the Research Projects, as outlined below. Use this format as much or as little as you’d like, as long as the required components are covered.
1. Research Question
The goal of this project is to forecast future calls for service activity in the City of Fairfax, Virginia for calendar year 2025.
These instructions will demonstrate how to forecast using an ARIMA model. An ARIMA model is an Autoregressive Integrated Moving Average, which is a univariate model predicting future values of a single variable over time. Let’s breakdown these terms:
- Autoregressive: “auto” means self, and “regressive” refers to using prior values.
- Integrated: “integrated” refers to the process of making data stationary, which identifies trends by differencing the data
- Moving Average: “moving average” is a calculation for a series of averages using subsets of data to identify trends over time. It is a more precise measure for smoothing out fluctuations.
Your research question should consider behavior, space, and time. You’ll either measure or control for each of these.
- Behavior: what call type(s)
- Space: where
- Time: when is the data from, when are you forecasting
For example, using 2008-2024 data, forecast trespass calls for service counts and locations in the City of Fairfax, VA for each month in 2025.
2. Data
Acquire
- Load the libraries you need for success:
library(tidyverse)
library(leaflet)
library(sf)
library(tigris)
library(zoo)
library(aTSA)
library(tseries)
library(forecast)- Get the latest and greatest Fairfax calls for service data:
calls.full <- read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
"1ti7BCMe1vvDVVr75p6lqvgouCy_ae7rn"))Wrangle
- Format the date, as a date, and add month columns (different styles) for later:
calls.full$DATE <- as.Date(calls.full$date, format = "%Y-%m-%d")
calls.full$MONTHS <- months(calls.full$DATE)
calls.full$MONTH <- month(calls.full$DATE)- Determine the total count and percentage of each call type:
sum.calls <- calls.full %>%
group_by(type) %>%
summarise(call.count = n()) %>%
mutate(pct = round(call.count/sum(call.count)*100,2))- Use the table from #4 to find and filter the call types to forecast (choose your own!):
group <- c("SUSPICIOUS", "MISC")
calls.subset <- calls.full %>% filter(calls.full$type %in% group)3. Methods
Building a relevant methodology is largely dependent on your research question and data. Any good methodology involves multiple steps for acquiring (done above), wrangling (done above), calculating, visualizing, and analyzing data. There is no single right answer here; rather, its the ability to craft a unique, distinct workflow that results in innovative work.
Some important considerations for how you use your data include:
- SPACE:
- Overall: Fairfax, VA
- Unit of Measurement: Neighborhoods? Streets?
- TIME:
- Overall: time range for your datas
- Unit of Measurement: Days, Months, Years
- BEHAVIOR:
- Overall: Types, categories
- Unit of Measurement: Call type(s)
To use an ARIMA model, we will identify values for p, d, and q:
- p: Lagged observations
- d: Differencing the data set
- q: Prior error handling
Calculate
- Calculate calls per day:
calls <- calls.subset %>%
group_by(DATE) %>%
summarise(CALL.COUNT = n())- Fill in blank days:
calls <- calls %>%
complete(DATE = seq.Date(min(DATE), max(DATE), by = "day")) %>%
mutate(WEEKDAY = lubridate::wday(DATE, label = T, week_start = 1),
MONTH = lubridate::month(DATE, label = T, abbr = F),
WEEK = isoweek(DATE),
DAY = day(DATE),
YEAR = year(DATE))
# replace the NAs with 0s
calls <- replace(calls, is.na(calls), 0)- Change the data type to a proper time series, update the sequencing:
cleaned.calls <- zoo(calls$CALL.COUNT,
seq(from = as.Date(min(calls$DATE)),
to = as.Date(max(calls$DATE)), by = 1))- Generate basic summary stats and a basic graph:
summary(cleaned.calls)## Index cleaned.calls
## Min. :2007-01-01 Min. : 0.000
## 1st Qu.:2011-06-17 1st Qu.: 4.000
## Median :2015-12-01 Median : 6.000
## Mean :2015-12-01 Mean : 6.186
## 3rd Qu.:2020-05-16 3rd Qu.: 8.000
## Max. :2024-10-31 Max. :37.000
plot(cleaned.calls)
title("My Calls Per Day") # this adds a title to your graph- Make the data stationary (normalized, as a deviation from previous values), and a new basic graph. You can do more/higher differences as necessary (but probably don’t need to). Examine the graphs and compare the spikes. These are just examples:
stationary1 <- diff(cleaned.calls, differences = 1)
plot(stationary1)stationary2 <- diff(cleaned.calls, differences = 2)
plot(stationary2)stationary3 <- diff(cleaned.calls, differences = 3)
plot(stationary3)Look for the data with the narrowest y axis range, as consistently as close to 0 as possible. In this example, stationary1 is the best dataset.
- Conduct an Augmented Dickey-Fuller test to determine if the data is stationary. The resultant score should be a negative number, and the lower the number the stronger the data is stationary. In this example, ‘cleaned.calls’ is run as a comparison to data that is not differenced.
adf.test(as.matrix(cleaned.calls))## Warning in adf.test(as.matrix(cleaned.calls)): p-value smaller than printed
## p-value
##
## Augmented Dickey-Fuller Test
##
## data: as.matrix(cleaned.calls)
## Dickey-Fuller = -8.6457, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary
adf.test(as.matrix(stationary1))## Warning in adf.test(as.matrix(stationary1)): p-value smaller than printed
## p-value
##
## Augmented Dickey-Fuller Test
##
## data: as.matrix(stationary1)
## Dickey-Fuller = -30.532, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary
adf.test(as.matrix(stationary2))## Warning in adf.test(as.matrix(stationary2)): p-value smaller than printed
## p-value
##
## Augmented Dickey-Fuller Test
##
## data: as.matrix(stationary2)
## Dickey-Fuller = -41.14, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary
adf.test(as.matrix(stationary3))## Warning in adf.test(as.matrix(stationary3)): p-value smaller than printed
## p-value
##
## Augmented Dickey-Fuller Test
##
## data: as.matrix(stationary3)
## Dickey-Fuller = -49.399, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary
In this example, all scenarios are negative. When combined with the plot analysis in Step 5, stationary1 is best. Ideally you want to manipulate the data as little as possible. While ‘stationary2’ and ‘stationary3’ are even more negative, they are not negative enough to warrant choosing them.
- Lagged autocorrelations - specifically using the Autocorrelation (ACF) and Partial Autocorrelation (PACF) functions - will help determine the p and q values. For both calculations and visuals, you will use the best stationary dataset you chose in Steps 5 and 6 above.
- Determine the p order by examining the PACF graph and data. Specifically, look for the last significant lag:
pacf(stationary1)pacf(stationary1, pl=FALSE)##
## Partial autocorrelations of series 'stationary1', by lag
##
## 1 2 3 4 5 6 7 8 9 10 11
## -0.452 -0.347 -0.242 -0.217 -0.170 -0.118 -0.092 -0.113 -0.094 -0.066 -0.050
## 12 13 14 15 16 17 18 19 20 21 22
## -0.088 -0.149 -0.057 -0.018 -0.031 -0.032 -0.057 -0.046 -0.029 -0.051 -0.040
## 23 24 25 26 27 28 29 30 31 32 33
## -0.026 0.019 -0.038 -0.048 -0.080 -0.030 0.009 -0.004 -0.025 -0.036 -0.028
## 34 35 36 37 38
## -0.015 -0.022 -0.037 -0.009 0.017
- Determine the q order by identifying the last significant lag in the ACF:
acf(stationary1)acf(stationary1, pl=FALSE)##
## Autocorrelations of series 'stationary1', by lag
##
## 0 1 2 3 4 5 6 7 8 9 10
## 1.000 -0.452 -0.072 0.031 -0.019 0.016 0.015 -0.009 -0.025 0.018 0.011
## 11 12 13 14 15 16 17 18 19 20 21
## -0.004 -0.030 -0.021 0.082 -0.011 -0.034 0.004 -0.015 0.020 0.011 -0.024
## 22 23 24 25 26 27 28 29 30 31 32
## 0.008 0.008 0.021 -0.052 0.010 -0.010 0.048 0.007 -0.037 -0.010 0.004
## 33 34 35 36 37 38
## 0.018 0.007 -0.017 -0.009 0.028 0.008
In this example, the p should be 8, and the q should be 1.
- Build an ARIMA function using the inputs derived from previous steps (based on analysis from the graph in Step 4, it’s reasonable to assume our data has season trends - so we’ll set it to TRUE):
arima.calls <- auto.arima(cleaned.calls, d = 1, max.p = 8, max.q = 1, seasonal = T)
summary(arima.calls)## Series: cleaned.calls
## ARIMA(2,1,1)
##
## Coefficients:
## ar1 ar2 ma1
## 0.0600 -0.0350 -0.9519
## s.e. 0.0132 0.0132 0.0049
##
## sigma^2 = 8.431: log likelihood = -16183.71
## AIC=32375.41 AICc=32375.42 BIC=32402.54
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.007592036 2.90269 2.228652 -Inf Inf 0.7497139 0.0006917396
Note that the ‘auto.arima’ function runs a series of ARIMA models and choose the best fit for the data. The best fitting model can be interpreted as ARIMA(d, p, q).
- Do a quick residuals checks on the model to see how it performed:
checkresiduals(arima.calls)##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,1)
## Q* = 14.957, df = 7, p-value = 0.03655
##
## Model df: 3. Total lags used: 10
Proper residuals derived from your model include:
- the top graph not demonstrating any clear, cyclical, repeating patterns
- the bottom right graph generally being a normally distributed bell curve
- Use the ARIMA function to do a sample forecast for the next week of data:
# h = the number of units of time to measure
forecast.7days <- forecast(arima.calls, h=7)
round(sum(forecast.7days$upper[,2]),0)## [1] 74
forecast.7days$mean## Time Series:
## Start = 20028
## End = 20034
## Frequency = 1
## [1] 4.763555 4.819397 4.831025 4.829767 4.829285 4.829300 4.829318
- Identify the number of days between the last date in your dataset and the end of 2025:
forecast.window <- as.numeric(as.Date("2025-12-31")-max(calls$DATE))- Forecast the number of calls per day for everyday in that window, and plot it:
forecast.2025 <- forecast(arima.calls, h=forecast.window)
autoplot(forecast.2025)- Extract the forecasted values as a table, clean up the column names, add the forecast date, and month:
forecast.values <- as.data.frame(forecast.2025$mean)
forecast.values$ID <- seq.int(nrow(forecast.values))
forecast.upper <- as.data.frame(forecast.2025$upper)
forecast.upper$ID <- seq.int(nrow(forecast.upper))
forecast.values <- forecast.values %>%
left_join(forecast.upper, by = 'ID')
colnames(forecast.values) <- c("MEAN", "ID", "CI80", "CI90")
forecast.values$DATE <- as.Date(max(calls$DATE) + forecast.values$ID)
forecast.values$MONTH <- months(forecast.values$DATE)- Filter to 2025, summarize forecasts by month:
forecast.values.2025 <- subset(forecast.values, forecast.values$DATE > '2024-12-31')
forecast.months <- forecast.values.2025 %>%
group_by(MONTH) %>%
summarise(MEAN = round(sum(MEAN),0), FORECAST.90 = round(sum(CI90),0), FORECAST.80 = round(sum(CI80),0))
forecast.months$DIFF <- forecast.months$FORECAST.90 - forecast.months$FORECAST.80Visualize
- Graph of calls with a trend line:
graph.calls <- ggplot(calls, aes(x=DATE, y=CALL.COUNT)) +
geom_line() +
scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
xlab("Years") +
ylab("Call Count") +
ggtitle("TITLE HERE") +
geom_area(fill="lightblue", color="black")
graph.calls +
geom_smooth(method = lm, col = "red", se = FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))- Graph actual calls with predicted calls:
calls2024 <- calls[c(1,2)]
calls2025 <- forecast.values[c(5,1)]
names(calls2025) <- c("DATE", "CALL.COUNT")
new.calls <- rbind(calls2024, calls2025)
graph.new.calls <- ggplot(new.calls, aes(x=DATE, y=CALL.COUNT)) +
geom_line() +
scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
xlab("Years") +
ylab("Call Count") +
ggtitle("TITLE HERE") +
geom_area(fill="lightblue", color="black")
graph.new.calls +
geom_smooth(method = lm, col = "red", se = FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))## `geom_smooth()` using formula = 'y ~ x'
- Tweak the data table for making a graph:
forecast.months$MONTH <- factor(forecast.months$MONTH, levels = forecast.months$MONTH)
forecast.long <- pivot_longer(forecast.months, cols = c(FORECAST.80, DIFF),
names_to = "Category", values_to = "Value")
forecast.long$Category <- gsub("DIFF", "90% Confidence Interval", forecast.long$Category)
forecast.long$Category <- gsub("FORECAST.80", "80% Confidence Interval", forecast.long$Category)- Graph your monthly forecasts, first for the upper bounds, and then for the mean:
ggplot(forecast.long, aes(x = MONTH, y = Value, fill = fct_rev(Category))) +
geom_bar(stat = "identity") +
geom_text(aes(label = Value), size = 3, colour = 'white', position = position_stack(vjust = 0.5)) +
labs(title = "City of Fairfax 2025 Monthly Forecast Upper Bounds",
x = "Month",
y = "Call Count") +
scale_fill_manual(values = c("90% Confidence Interval" = "blue",
"80% Confidence Interval" = "grey"),
name = "Forecasts") +
scale_x_discrete(limits = c("January", "February", "March", "April", "May", "June", "July", "August",
"September", "October", "November", "December")) +
coord_cartesian(ylim = c(0, 500)) +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))ggplot(forecast.months, aes(x = MONTH, y = MEAN)) +
geom_bar(stat = "identity") +
scale_x_discrete(limits = c("January", "February", "March", "April", "May", "June", "July", "August",
"September", "October", "November", "December")) +
coord_cartesian(ylim = c(100, 175)) +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "City of Fairfax 2025 Mean Monthly Forecast",
x = "Month",
y = "Call Count")- Map your monthly hotspots.
- Get the city outline and roads
fairfax.roads <- roads("VA", "Fairfax city")
fairfax.outline <- county_subdivisions("VA", "Fairfax city")- Filter your data by month
- Choose a hotspot method
- Facet by year
ggplot() +
geom_sf(data = fairfax.outline, color = "grey") +
geom_hex(aes(x = lon, y = lat), data = calls.subset, bins = 8) +
scale_fill_continuous(type = "viridis") +
geom_sf(data = fairfax.roads, color = "black", alpha = 0.4) +
theme_classic() +
facet_wrap(~ MONTH)- Repeat for multiple years to confirm any monthly patterns.
4. Findings
INSTRUCTIONS
- Describe the model
- Describe the outputs
- Contextualize the results
- Describe the hotspots
- Make it real
- Make it usable
Writing Tips
Related, some general notes about writing. Welcome to grad school, where the quality of your writing is held to a higher standard. Write directly. Thus, avoid using these words and phrases:
- Simply. Very. Indeed. Specifically. Perhaps. Arguably. Usually. Essentially. Could possibly. Could potentially. Necessarily. Only. Especially. In order to. Likely. Primarily. Generally. It is important that. It is essential to note.
Also, words have meanings. Some are stronger than others. Others aren’t real.
Further, here’s some other strategies to consider:
- Read sentences aloud… but don’t write like you speak.
- Ask yourself - does this sentence make sense, by itself.
- It’s ok to occasionally use quotes, don’t copy specific language or words.
- If you don’t know the meaning of a word, don’t use it!
And finally, this is just fun.
Other Information/Reminders
- When submitting your projects, you are submitting either an HTML or PDF output of your Markdown file, as an upload to Canvas - not via email.
- When you are ready to see what the output looks like, click the ‘Knit’ button. The output file will be saved to the same folder this script is saved to.
- Never ever ever include any install.packages functions. Either comment them out with a #, or delete them.
Grading: 30pts
1. Research Question
2pt
Provide a brief paragraph (1-2 sentences) on the research question. State the question, in terms of behavior, space, and time. Define key terms and concepts, to include how variables will be measured.
2. Literature Review
2pts
Provide a brief paragraph describing a prior study related to your current project. This study should be from a peer-academic academic journal. Include details about the research question, data, methods, and findings. Cite the article in APA format.
3. Data
1pt
Provide a brief paragraph on the data used in this study. This includes the specific source(s), and the specific temporal and spatial constraints. Address any major advantages and disadvantages with using these data compared to other sources. Be specific enough that a reader could replicate this work.
4. Methods
2pts
Provide a brief paragraph on the research methods leveraged to answer this question. This includes the software, calculations, skills, techniques, and unique workflows used to analyze your data and develop an answer. You do NOT need to describe click-by-click instructions or lines of code describing how you did things; you DO need to describe a logical process that is specific enough that a reader could replicate.
5. Findings
12pts
Provide one brief paragraph (4-5 sentences) per month forecasted (2pts each). Each paragraph should include forecasted counts, and specific areas of the city to focus on. Your findings should be descriptive, clear, and comprehensively answer the original question.
7. Visuals
10pts
Provide a graph that highlights your model (2pts).
Provide a graph showing the forecasted counts for each month in 2025. This graph should have a title and correct axis labels (2pts).
Provide hotspot maps (6pts).
All visuals should be thoughtful and logical (make sense to someone that has never seen them before); somehow, someway referenced in the text; and analytically meaningful (provide useful, relevant, and actionable insights)
8. Formatting
1pt
- Error-free writing. No typos, run-on sentences, or sentence fragments. Proper punctuation. Real words. Need help paraphrasing dense content? Try this. And here’s a great reference, too.
- File submitted as the knitted HTML output of an RMarkdown file, via Canvas.
ONE LAST THING… the source code for creating this file can be found here.
Please email me with any questions.